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O ' Abstract 

, We consider the problem of learning an unknown product distribution X over {0, 1 )" using samples 

d • f(X) where / is a known transformation function. Each choice of a transformation function / 

specifies a learning problem in this framework. 

Information- theoretic arguments show that for every transformation function / the corresponding 
learning problem can be solved to accuracy e, using (5(«/e^) examples, by a generic algorithm 
whose running time may be exponential in n. We show that this learning problem can be computa- 
I tionally intractable even for constant e and rather simple transformation functions. Moreover, the 

above sample complexity bound is nearly optimal for the general problem, as we give a simple 
explicit linear transformation function f{x) - w ■ x with integer weights w, < n and prove that the 
^ \ corresponding learning problem requires Q.{n) samples. 

As our main positive result we give a highly eflicient algorithm for learning a sum of independent 
unknown Bernoulli random variables, corresponding to the transformation function fix) - ^i- 
. Our algorithm learns to e-accuracy in poly(n) time, using a surprising poly ( 1 /e) number of samples 

I that is independent of n. We also give an efficient algorithm that uses log n ■ poly( 1 /e) samples but 

Q,^ . has running time that is only poly(log «, 1 /e). 

' 1 Introduction 

cn 

, We consider the problem of learning an unknown product distribution that has been transformed according 

to a known function /. This is a simple and natural learning problem, but one which does not seem to have 
been explicitly studied in a systematic way from a computational learning theory perspective. 

More precisely, in this paper we restrict our model to the natural case when the input distribution is a 
a product distribution over the Boolean cube {0, 1}". In this learning scenario the learner is provided with 
I samples from the random variable f{X), where X - (Xi, . . . ,X„) is a vector of independent 0/1 Bernoulli 

5-H ■ random variables Xi whose expectations are unknown to the learner We write p - (pi,. . . ,p„) e [0, 1]" to 

denote E[X], and refer to /J as the target vector of probabilities; we shall sometimes write /(/J) to denote the 
random variable f(X) described above. Using these samples, the learner must with probability 1 - (jQ output 
a hypothesis distribution "H over /({0, 1)") such that the total variation distance divifi^),'^) is at most e. 
A proper learning algorithm in this framework outputs a hypothesis vector p e [0, 1]" defining a hypothesis 
distribution fiX), where X - {X\ , . . . , X„) is a vector of independent 0/1 Bernoulli random variables X, whose 
expectation is E[X] = p. 

We emphasize that in this learning scenario, the transformation function / is fixed and known to the 
learner, the choice of a particular transformation function / specifies a particular learning problem in this 
model, much as the choice of a concept class C specifies a learning problem in Valiant's PAC learning model. 
We will be interested in both the computational complexity (running time) and sample complexity (number 
of samples required) for algorithms that solve this problem, for different transformation functions /. 
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1.1 Motivation, examples, and connection to prior work 

Our motivation for considering this model is twofold. First, we feel that it is so simple and natural as to 
warrant study for its own sake. Second, we believe that it offers a useful perspective on modeling probability 
distributions in settings where the underlying source of randomness is not directly accessible to the learner In 
many settings we may wish to understand some phenomenon (in the physical world, in a market, etc.) where 
the available observations can be viewed as the output of a transformation / applied to some underlying 
random source X; learning an accurate approximation of the distribution of f{X) is a natural goal in such a 
setting. (The restriction on X imposed in this paper - that it is a product of independent Bernoulli random 
variables - admittedly represents an idealized scenario, but it is a natural starting point for theoretical study.) It 
is plausible that in such a situation the transformation function / may be well understood (as a consequence of 
our knowledge of the laws governing the physical world, the marketplace, etc.), but that much less is known 
about the parameters of the underlying random variable X (we may have no direct access to this random 
variable, it could represent private information, etc.). This corresponds to our model's assumption that / is 
"known" and the task is to infer the parameters of X that give rise to the observed data. 

Examples: As a simple example to illustrate our learning model, we consider the product distribution 
learning problem for / where / is any read-once AND-gate function. A function / : {0, 1}" —> {0, 1}'", 
fix) = (/i (x), . . . , fm(x)) is a read-once AND-gate function if each /i(x) is an AND over some subset S ; c [«] 
of the n input bits x\, . . . ,x„ and the sets 5 1, . . . , 5,„ are pairwise disjoint. It is not hard to see that there is 
a straightforward proper learning algorithm based on linear programming that succeeds for any read-once 
AND-gate function regardless of the fanin of the AND gates (see Appendix [All : 

Observation 1.1 Let f : {0, 1}" — > {0, 1}™ be any fixed read-once AND-gate function (known to the learner). 
There is an algorithm that uses poly(n, 1/e) samples from the target distribution fCp), runs in poly(n, l/e) 
time, and with probability at least 9/10 outputs a hypothesis vector p such that djv{f(p),fi.p)) ^ e- 

As a second example, we point out that the transformed product distribution learning model is broad 
enough to encompa ss the problem of learning an unknown mi x ture o f k product distribution s over {0, 1 )" that 
was considered by Freund and Mansourl (1 19991) . ICrvan et aD (l2002h . iFeldman et al.l (l2008l) . For simplicity 
we describe the case k — 2: there are unknown product distributions 'p,q over {0, 1}" and unknown mixing 
weights Tip, jiq - \ - Tip. The learner is given independent draws from the mixture distribution (each draw is 
independently taken from 'p with probability Tip and from 'q with probability Tiq), and must output hypothesis 
product distributions p, q and hypothesis mixing weights Tip, Tiq. This problem is easily seen to be equivalent 
to the transformed product distribution learning problem for the function / : {0, — > {Q, 1}" which is 
such that on input (z,xi, . . . ,x„,ij\, . . . , y„) e {0, 1}^"+' the i-th bit of /'s output is zx,- -i- (1 - z)i/;. It is easy to 
see that if the target vector of probabilities for / is {Tip,pi, . . . ,p„,qi, . . .q„) then samples of / are distributed 
exactly according to the mixture of product distributions, and finding a good hypothesis vector in [0, 
amounts to finding a hypothesis mixing weight and hypothesis product distributions p, q as required in the 
original "learning mixtures of product distributions" problem. 

Connection to prior work: The transformed product distribution learning model i s related t o the P AC- 
style model of learning discrete probability distributio ns that was introduce d by Kea rns et al. ( 1994b an d 



studied in several subsequent works of Naor (1996), Ambainis et al. (1997), Farach and Kannan (1999) 
iFreund and Mansour (1999) , .Crvan et al. (2002), Fel dman et al. (2008) . In the Ke arns et al. (1994) frame- 
work a learning problem is defined by a class C of Boolean circuits, and an instance of the problem corre- 
sponds to the choice of a specific (unknown to the learner) target circuit C e C. The learner is given samples 
from CiX) where X is a uniform random string from {0, 1 )"', and the learner must with high probability output 
a hypothesis circuit C such that the random variable C'iX) is e-close to C{X) (in KL-divergence). 

Strictly speaking the transformed product distribution learning model may be viewed as a special case 
of the Kearns et al model. This is done by considering a circuit class C that has a circuit C - Cp for each 
possible product distribution p over {0, 1 }"; the circuit C-p first transforms the uniform distribution over {0, 1 
to the product distribution p over {0, 1 )" and then applies the transformation function / to the outp ut of p. 
Howev er, learning problems C of this sort do not seem to have been previously considered in the lKeams et al.l 
(11994 ) model, and we feel it is more natural to view our model as dual in spirit to the earlier model. In 



iKearns etalJ(ll994 the learner's task is to infer an unknown transformation (the circuit C) into which are fed 
n-bit strings that are known to be distributed uniformly. In our case the transformation function / is known to 
the learner but the underlying product distribution that is fed into / is unknown and must be inferred. 

1.2 Our results and techniques 

We establish a range of positive and negative results for this learning problem, both for general functions and 
for particular transformation functions of interest. For most of this paper we focus on the case in which f{X) 
is simply a real- valued random variable, i.e. / is a transformation mapping {0, 1}" to R. 
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We begin by considering the most general possible setting, in which the transformation function / 
ca n be any function mapping the domain {0, 1}" into any range. By an approach similar to the algorithm 
of iDevrove and Lugosil (|200lb for choosing a density estimate, we show (Theorem |6]l that if the space 
{fi'P))pe[u.\]» of all /-transformed product distributions has an e-cover of size A^, then there is a generic 
learning algorithm for the /-transformed product distribution problem that uses 0((logN)/£^) samples. The 
algorithm works by carrying out a tournament that matches every pair of distributions in the cover against 
each other; our analysis shows that with high probability some e-accurate distribution in the cover will survive 
the tournament undefeated, and that any undefeated tournament will with high probability be highly accurate. 

As an immediate consequence of the general result Theorem|6]we get that for any transformation function 
/ there is an algorithm that learns to accuracy e using Oinje^) samples: 

Theorem 1 (Information-theoretic upper bound for any /) Let f : [0,1]" Q. be an arbitrary function 
where Q is any range set. There is an algorithm that uses 0{{n/£^) ■ log(n/e)) samples from the target 
distribution fCp), runs in time (n/e)''*"', and with probability at least 9/10 outputs a hypothesis vector p such 
thatdrvifiDJiP)) < £■ 

Since an e-cover of the space of all /-transformed product distributions may have size exponential in n, 
Theorem[T]does not in general provide a computationally efficient algorithm. Indeed, in AppendixIClwe show 
that the learning problem can be computationally hard even for rather simple transformation functions: using 
a reduction to the PARTITION problem, we prove: 

Theorem 2 (NP-hardness) Suppose NP ^ BPR Then there is an explicit degree-2 polynomial f : {0, 1 }" 
R such that there is no polynomial-time algorithm that solves the transformed product distribution learning 
problem for f to accuracy e — 1/3. 

We also show that even for a simple linear transformation function f(x) - w-x with small integer weights, 
it can be impossible to significantly improve on the (5(«) sample complexity of the generic algorithm from 
Theorem[T] In Appendix iP] we prove: 

Theorem 3 (Sample complexity lower bound) Fix any even k <n and let f(x) - 'Zj'i=k/2+i ^ '^"3' 

learning algorithm that outputs a hypothesis vector p such that dTv(f('p),f(p)) < 1/40 with probability at 
least 6"°**^ Then L must use Q.{k) samples from f(p). 

These negative results provide strong motivation for considering what is perhaps the most natural of all 
transformation functions mapping {0, 1 )" to R, the sum f{x) - x,; we refer to the corresponding learning 
problem as "learning an unknown sum of BemoulU random variables." Our main contribution is a detailed 
study of this learning problem. 

Learning sums of BernouUis from constantly many samples. As our main result, we show that any sum of 
independent unknown Bernoulli random variables can be efficiently approximated to e-accuracy by a proper 
learning algorithm that uses poly(l/e) samples, independent of n. More precisely, we prove: 

Theorem 4 (Learning sums of BernouUis from constantly many samples) Let f(x) = Zi;Li x,- There is 
an algorithm that uses polyil/e) samples from the target distribution f(p), runs in time n^ ■ poly(l/e) + n ■ 
(l/e)''*'°^ (i/f))^ and with probability at least 9/10 outputs a hypothesis vector p e [0, 1]" which is such that 
dTv{f{J)J{p))<e. 

It should be stressed that the generic algorithm of Theorem|6]requires f2((l/e^) log«) samples for this learning 
problem (easy arguments give an n""' lower bound on the size of any cover for sums of BernouUis). We view 
this sample complexity independent of n as a surprising result which may find subsequent applications. We 
next give a brief overview of the obstacles that appear and the techniques involved in our proof. 

As a first step in the proof of Theorem 2] we observe that a simple learning algorithm using 0(1 /e^) 
samples gives a hypothesis which has error at most e with respect to the Kolmogorov distance (see Section|2]i. 
While the algorithm itself is simple, its analysis relies on a fundamental result from probability theory, known 
as the Dvoretzky-Kiefer-Wolfowitz inequality (iDvoretzkv et al.l ( Il956l) ). w hich may be viewed as a sp ecial 
case of the fundamental Vapnik-Chervonenkis theorem (see Chapter 3 of iDevrove and Lugosil ([2001)). In 
Appendix|E]we give a self-contained proof of the DKW inequality using elementary techniques (marti ngales 
and th e method of bounded diff'erences) and an interesting trick that goes back to Kolmogorov (see jPeresI 
(l2009h): this proof is s ignifican tly different from the pr oofs we know of in the pro bability literature (see 
IDvoretzkv etaTI (ll956l) . lMassar3 (ll990), and Chapter 3 of IDevrove and Lugosil (|200I|) ). 

A natural attempt is to use Kolmogorov approximation as a black box to obtain an approximation in total 
variation distance. In the special case when X is a sum of Bernoulli random variables and F is a binomial 
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distribution B(n,p) (i.e. all the Bernoullis Yi,. . .,¥„ have the same mean p), it is indeed possible to bound 
the total variation distance between X and F as a function of their Kolmogorov distance. It can be shown that 
in this case the distributions of X and Y cross each other at most a constant number of times, and this is easily 
seen to imply that the two distances (total variation and Kolmogorov) are wi thin a constant factor This fact 
about crossings goes back to an argument by Newton (c.f. Hardy et all ( IT9341) section 2.22), establishing that 
the sequence ak = Pt[X - k]/Pr[Y = k] is log-concave if Y is binomially distributed. 

Unfortunately, if X and Y are both generic sums of independent Bernoullis with arbitrary means (as in 
our setting), then such a bound is not known in the literature, and indeed it seems that essentially nothing is 
known about the relation between the two distances in this general case. Without such a bound, it is rather 
unclear whether a number of samples that does not scale with n suffices to accurately leam in total variation 
distance. Nevertheless, we extend the Kolmogorov distance learning algori thm to total variation distance 
via a delicate algorithm that exploits the detailed structure of a small e-cover i Daskalakis and Papadimitrioul 
(^011), Daskalakis (2008)) of the space of all distributions that are sums of independent Bernoulli random 
variables (see Theorem |9}. Interestingly, this becomes feasible by establishing an analog of the aforemen- 
tioned argument by Newton to a class of distributions used in the cover that are called heavy (see Lemma [TS]). 
This in turn relies on probabilistic approximation results via translated Poisson distributions (see Defini- 
tion [TH HHm^ (EOOl)) . 

Learning sums of Bernoullis in sublinear time. While the poly(l/e) sample complexity of Theorem |4] is 
essentially optimafl the running time is poly(«) (and superpoly(l /e)). Moreover, generating a single sample 
from the hypothesis product distribution requires 0(n) uniformly random bits. In general, any proper learning 
algorithm that explicitly outputs a hypothesis vector p e [0, 1]" will take Q.{n) running time, and generating a 
sample from the hypothesis f{p) will require Q(n) random bits. 

More broadly, any algorithm (not necessarily proper) for learning sums of Bernoullis must have running 
time 0((l/e^) ■ logn) in the bit model (since each sample is an Q(logn) bit string and 0(l/e^) samples 
are needed). Generating a sample from an arbitrary hypothesis distribution will in general require Q(logn) 
bits (since the entropy of the binomial distribution is Q(log «)). Hence a natural goal is to have a learning 
algorithm that runs in poly(log «, 1 /e) time and requires (9(log n) bits of randomness to generate a draw from 
its hypothesis distribution. As discussed in the previous paragraph, such an algorithm needs to be non-proper 

We show that there is an algorithm that satisfies all of these efficiency considerations, at the cost of a log n 
factor in the sample complexity: 

Theorem 5 (Learning sums of Bernoullis in polylog(n) time with efficient hypotheses) Let f(x) - x,. 
There is an algorithm that uses log(n) • poly( 1 /e) samples from the target distribution f(p), performs log^(n) • 
poly(l/e) bit operations, and with probability at least 9/10 outputs a (succinct representation of a) hypoth- 
esis distribution "TY over {0,1, ... ,n} such that divifiX),'}^) < e. Moreover, a draw from the hypothesis 
distribution 'H can be obtained in poly(logn, 1/e) time using 0(log(n/e)) bits of randomness. 

The key to Theorem |5] is the simple observation that any sum of Bernoull is is a unimodal distribution 
over the domain {0, 1, ... , «). This lets us apply a powerful algorithm due to iBirgel (0.997) that can learn 
any unimodal distribution to accuracy e using 0{logn)/£^ samples. The algorithm outputs a hypothesis 
distribution that is a histogram over 0((log n)/e) intervals that cover (0, . . . , n): more precisely, the hypothesis 
is uniform within each interval, and for each interval the total mass it assigns to the interval is simply the 
fraction of samples that landed in that interval. Thus, the hypothesis distribution has a succinct description 
and can be efficiently evaluated using a small amount of randomness. We give details in Appendix|F] 

We remark that by applying the algorithm of Theorem |5]to the hypothesis distribution that is provided by 
TheoremlH one can obtain a 2e-accurate hypothesis satisfying the efficiency conditions of Theorem|5](i.e. the 
hypothesis can be evaluated in poly-logarithmic time and a sample from it can be generated using (9(log(n/e)) 
random bits). Thus, it is possible to construct an e-accurate efficient hypothesis using poly(l/e) samples and 
poly(n) time. Whether this can be achieved in poly-logarithmic time is an interesting and challenging open 
problem; we discuss this and other questions for future work in Section |6] 

2 Preliminaries 

Recall that the total variation distance between two distributions P and Q over a finite domain D is djv (P, Q) ■ = 
(1 /2) ■ YjaeD P(ff) - Q(a')l- Similarly, if X and Y are two random variables ranging over a finite set, their total 
variation distance djviX, Y) is defined as the total variation distance between their distributions. Another no- 
tion of distance between distributions/random variables that we use is the Kolmogorov distance. For two dis- 
tributions P and Q supported on R, their Kolmogorov distance is d^ (P, Q) :- sup^^jj |P((-oo, x]) - Q((-oo, x])\ . 

^An easy reduction to distinguishing a fair coin from an e-biased coin shows that any learning algorithm for this 
problem needs fl(l/e^) samples. 
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Similarly, if X and Y are two random variables ranging over a subset of R, their Kolmogorov distance, denoted 
d^iX, Y), is the Kolmogorov distance between their distributions. If two distributions P and Q are supported 
on a finite subset of R we obtain immediately that c/k (P, Q) < 2 ■ d^v (P, Q) • 

Fix a finite domain D, and let denote some set of distributions over D. Given S > Q, a subset Q c'P is 
said to be a 6-cover ofP (w.r.t. total variation distance) if for every P in 'P there exists some Q in (3 such that 
d-TvCP, Q) < S. 

We write S - S„to denote the set of all product distributions over {0, 1 }", and f(S) to denote the set of all 
transformed product distributions {f(p)]-pes- We write f(p)(a) to denote the probability of outcome a under 
distribution /(p). Finally, for ^ e Z+ we write [i] to denote 

3 A generic algorithm for any / and some lower bounds 

In this section we give a simple generic algorithm to solve the transformed product distribution learning 
problem for any transformation function /, and state some lower bounds showing that even for rather simple 
functions /, the time and sample complexity of the generic algorithm may be essentially the best possible. 

3.1 A generic algorithm 

The key ingredient in the generic algorithm is the following: 

Theorem 6 Fix a function / : {0, 1}" — > Q where Q. is any range set. Suppose there exists a 6-cover for f{S) 
of size N — N{n, 6). Then there is an algorithm that uses 0(6^^ log A^) samples and solves the f -transformed 
product distribution learning problem to accuracy 66. 

The high-level idea behind Theorem |6] is as follows: for a pair of distributions Qi,Q2 e S, we define 
a competition between Qi and Q2 that takes as input a sample from the target distribution f(X) and either 
crowns one of Qi,Q2 as the winner of the competition or calls the competition a draw. Let c 5 be a 
<5-cover for f{S) of cardinality = N{n, 6). The algorithm performs a tournament between every pair of 
distributions from Q and outputs a distribution Q* e (3 that was never a loser, i.e. won or achieved a draw in 
all competitions. (If no such distribution exists, the algorithm outputs "failure.") 

This basic approach of running a tournament between distributions in an ^-cover is quite similar to the 
algorithm of Devro ye and Lugosi for choosing a density estimate (see iDevrove and Lugosil ( 1996a.b) an d 
Chapters 6 and 7 of lOevrove and Lugosil (120011) ). which in turn built closely on the work of lYatracosI (119851) . 



Our algorithm achieves essentially the same bounds as these earlier approaches but there are some small 
differences. (The DL approach uses a notion of the "competition" between two tournaments which is not 
symmetric under swapping the two competing tournaments, whereas our competition is symmetric; also, 
the DL approach chooses a distribution which wins the maximum number of competitions as the output 
distribution, whereas our algorithm chooses a distribution that is never defeated.) We give our proof of 
Theorem|6]in Appendix iB] 

Theorem[T]is an easy consequence of Theorem|5] Recall Theorem[T] 

Theorem [T] (Information-theoretic upper bound for any /) Let / : {0, 1)" ^ Q fee an arbitrary function 
where Q. is any range set. There is an algorithm that uses 0((n/e^) ■ log(n/e)) samples from the target 
distribution fCp), runs in time (n/e)'^^"\ and with probability at least 9/10 outputs a hypothesis vector p such 
that d-Yv{f(p)J{p)) < e. 

Proof: We argue that for any / there is a (5-cover for f{S) of size at most in/6)". The desired result 
then follows from Theorem |6] Since for any pair of distributions P, Q € 5 and any function / we have 
djvifi^), /(Q)) < d-Yvif, Q), it suffices to exhibit a 5-cover of the desired cai'dinality for S. 

We claim that if we discretize each individual expectation of our input Bernoulli random variables to 
integer multiples of a :- f , we obtain a (5-cover for S. Let us call the set of all such discretized product 

distributions a Clearly, \Q\ < (|)". Let P = (Pi,...,P„) e S. Consider a point Q = (Qi,...,Q„) 6 Q 
such that d-YviQi, P;) < 6/n for all i. Since both P and Q are product distributions, we have that djv(^, Q) < 
Z"=i drviQi, Pi) < S. This completes the proof. ■ 



3.2 Learning transformed product distributions can be computationally hard 

Though Theorem [T] shows that any learning problem / in our framework can be solved with (5(«) sample 
complexity, it is natural to expect that some learning problems can be computationally hard. We confirm this 
intuition by establishing an NP-hardness result for a specific function / that is computed by an explicit degree- 
2 polynomial. We show that if there is a poly(«)-time algorithm for the transformed product distribution 
learning problem for this /, even for learning to constant accuracy, then NP c BPP. Recall Theorem|2l 
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Theorem |2] Suppose NP ^ BPP. Then there is an explicit degree-2 polynomial f such that there is no 
polynomial-time algorithm that solves the transformed product distribution learning problem for f to accu- 
racy e = 1/3. 

The proof is a reduction from the PARTITION problem and is given in AppendixICl 
3.3 Linear transformation functions can require Q(n) samples 

We now show that even for a simple linear transformation function f(x) - w ■ x with small integer weights, 
it can be impossible to significantly improve on the (5(n) sample complexity of the generic algorithm from 
Theorem[T] Recall Theorem|3] 

Tlieorem|3] (Sample complexity lower bound) Fix any even k < n and let f{x) - 'Z!i=k/2+i ^ '^"3' 

learning algorithm that outputs a hypothesis vector p such that dTv(f(p)yf(p)) ^ 1/40 with probability at 
least 6"°**^ Then L must use Q.(k) samples from f(p). 

Theorem[3]is proved in Appendix ID] 

4 Learning an unknown sum of BernouUis from poly(l /e) samples 
4.1 Learning with respect to Kolmogorov distance 

Let X be any random variable supported on {0,1, ... ,n]. We write Fx and fx to denote respectively the 
cumulative distribution and the probability density function of X. 

Let Zi, . . . ,Zt be independent samples of the random variable X, and define zf^ :- \z,<i for all E - 

0,...,n and / = \,...,k. Clearly we have E[2/ZfV't] = FxiC), which suggests that Fx{() Y.i'^f 1^ 
may be a g ood estimator of Fx{l\ for all va l ues of i, if k is large enough. The Dvoretzky-Kiefer-Wolfowitz 
inequality (iPvoretzky et al.l (1 1 956h . iMassartI d 1 9901) ) confirms this, and in fact shows that a surprisingly small 
value of A: - independent of n - suffices. The bound on k given below is optimal up to constant factors. 

Theorem 7 (DKW Inequality) Let k - max{576, (9/8) ln(l /5)} • (1/e^). Then with probability at least 1 - 5 
we have maxo<f<„ \Fx{(-) - Fx{(i\ < e. 

In Appendix|E]we give a self-contained proof of the theorem using elementary techniques (martingales and 
the method of bounded differences) and an interesting trick that goes back to Kolmogorov (see Peres (2009)). 
We start by defining a coupling between the process of learning the cumulative distribution function as our 
samples are revealed and a random walk on the line. Then Kolmogorov's trick is invoked to get a handle on 
the maximum estimation error, proving a weaker version of the theorem in which k equals &(-^). We then 
apply McDiarmid's inequality to bootstrap the weaker bound and obtain the tighter bound. 

Now we specialize to the case in which X - X, is a sum of independent Bernoulli random variables. 
We use the DKW inequality to prove the following: 

Theorem 8 (Proper Learning under Kolmogorov Distance) Let X = li"=i ^/ be a sum of independent 
Bernoulli random variables. There is an algorithm which, given k = max{9216, 181n(l/5)) ■ (1/e^) inde- 
pendent samples from Fx, produces with probability at least \ — 6 a set of independent BernouUis Yi, . . . ,Y„ 
such that dxiX, Y) < e, where Y :— Yi. The running time of the algorithm is poly(n/e) + n ■ (i)'-'('°E 7). 

Proof of Theorem m Use Theorem|7]to produce an | -approximation {Fx({)]e of the cumulative distribution 
of X. Theorem |9]below gives us that for all 7 > 0, there exists a y-cover in total variation distance of the set 

of all sums of n Bernoulli random variables that has size poly(n/y) + n ■ (^)'^^^°° Construct such a cover 
using y = e/8. Given that the Kolmogorov distance between two distributions is always at most twice their 
total variation distance, this cover is in fact a e/4-cover in the Kolmogorov distance. Output any Y - 2;Li Yi 
in the cover whose cumulative distribution Fy satisfies 

max \Fy(£) - Fx{()\ < e/2. (1) 

Q<e<n 

It is easy to see that a Y satisfying ([T]l exists in the cover. Indeed, if Y is the closest point of the cover 
to X in Kolmogorov distance, then it must be that maxo<f<„ \Fy{() - Fx{()\ ^ f/4. Given that {Fx(()]r is an 
e/4-approximation to Fx the above inequality implies ([TJ. 

Moreover, it is easy to check that any Y satisfying ^ will satisfy maxo<f<„ IFyiO - Fx{{)\ ^ 3e/4 < e, 
using again that {Fx(i)]e is an e/4-approximation to Fx- Hence, we have d^iX, Y) < e. m 
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4.2 From Kolmogorov distance to total variation distance 

The algorithm of the previous subsection learns the target sum of Bemoullis to high accuracy with respect to 
Kolmogorov distance. Ideally, we would like to use this approximation as a black box to obtain an approx- 
imation in total variation distance. As we discussed in the introduction, this runs to a very basic, apparently 
unresolved question in probability theory: is there a bound on the total variation distance between two sums 
of independent indicators in terms of their Kolmogorov distance? If at least one of the two sums is a Binomial 
distribution, then an argument due to Newton gives a positive answer However nothing is known about the 
relation between the two distances in this general case. Without such a bound, it is rather unclear whether 
constantly many samples (independent of n) suffices to accurately learn in total variation distance... 

Nevertheless, we manage to extend our Kolmogorov distance learning algorithm to the total variation 
distance via a delicate algorithm that exploits the structure of a small e-cover (in total variation distance) 
of the space of all distributions that are sums of independent Bernoulli random variables. Interestingly, 
this becomes feasible by establishing an analog of the aforementioned argument by Newton to a class of 
distributions used in the cover that are called heavy; and this argument relies on probabilistic approximation 
results via translated Poisson distributions (see DefinitionfTSll. 

We give more details below. Let us start by formally stating a theorem that defines a cover (in total 
variation d istance) of the space of sums of inde pendent indicators. The following is Theorem 9 of the full 
version of (iDaskalakis and Papadimitrioul(l201 ll) ): 

Theorem 9 (Cover for sums of Bernoullis) For all e > 0, there exists a set Q S such that (i) \Sc\ < n^ ■ 

I , \0(log- 

(9(l/e)+n-( ^ j ; (ii) For every {X, }/ 6 S there exists some {F, }; € such that djviTji TjI Yd < e(i.e. 

f{S^) is an e-cover off(S) ); and ( Hi) the set can be constructed in time O {rr' ■ (9(1 /e) + n - {j^ ^ ^ ^ 

Moreover, if {Yi]i 6 S^, then the collection {F,); has one of the following forms, where k — kie) — 0(l/e) is a 
positive integer: 

• (Sparse Form) There is a value t <k^ — Oilje^) such that for all i < £ we have E[y,] e |p-, p-, . . . , ^p^}, 
and for all i > { we have E[F,] 6 {0, 1 ). 

• (k-heavy Binomial Form) There is a value £ e {0,1, . . .,n] and a value q e ^-g^, -g^, . . . , ^|^} such that for 
all i < ( we have E[F,] = q; for all i > £ we have E[y,] e {0, 1}; and £, q satisfy the bounds £q > k^ — j 
and £q( l-q)>k^-k-l-j. 

(We remark that lDaskalakisI (|2008|) establishes the same theorem, except that the size of the cover given there, 

, /,\0(l/e-) 

as well as the time needed to produce it, are n ■ (9( 1 /e) + n ■ I ^ 1 . Indeed, this weaker bound is obtained by 
enumerating over all possible collections {y, ), in sparse form and all possible collections in fc-heavy Binomial 
Form, for k = 0(1 /e) specified by the theorem.) 

Using the cover described in Theorem|9] we prove the following, which immediately gives Theorem]?] 

Theorem 10 (Learning under Total Variation Distance) LetX = Xi be a sum of independent Bernoullis. 
Fix any t > 0. There is an algorithm which, given 0(-^) independent samples from X, produces with proba- 
bility at least 9/10 a list of Bernoulli randomvariablesYi, ... ,Y„ such that djvi^^Y) < e, whereY :— Yi. 

The running time of the algorithm is rv' ■ poly(l/e) + n ■ (l/e)''*'°^ 

We first give a high-level outline of our argument. The proof works by considering the points in a cover 
Sffi where fi is some constant > 1. We define two tests that can be performed on points (i.e. distributions) in 
S^. The first of these, called the A-test, is run on every sparse form distribution in Sffi, and is designed to 
identify a sparse form distribution that is close to X if such a distribution exists. The second test, called the 
//-test, is run on every A:-heavy Binomial form distribution in 5^ and is designed to identify a Binomial form 
distribution that is close to X if such a distribution exists. Since S^ii is a cover, some test will succeed. (Of 
course, we must also show that for each test, any distribution it outputs is indeed legitimately close to X; this 
is part of our analysis as well.) 

We now enter into the detailed proof. Let yS = 1 -H ^ and let a = 4 H- ^. Using Theorem]?] from 0{e'^°'') 
independent samples of X we can obtain estimates \Fx(£y\'l^Q such that \Fx{£) - Fx{£)\ < e", for all £, with 
probability at least 9/10. For the rest of the proof we condition on the event that each of our estimates Fx(£) 
is indeed within e" of the actual value Fx(£). Define /x(0) = Fx{0) and fx(z) = Fx(z) - Fx(z - 1), for all 
z e [n]. 
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4.2.1 Handling sparse form distributions in the cover. 

Let Y = 2, Yi, where J/ := {J'/I'Li £ S^, and suppose that J/ is of the sparse form, as defined in statement 
of Theorem |9] Let supp(}') denote the support of Y. By the definition of the sparse form, we have that 
|supp(F)| < (ce)-^/^, where c is some universal constant. Define as follows: 

Af=^( E \fY(z)-fx(z)\ + l- 2 \fx(z)\). 

We observe that given {Fx(() )"^o and Y, the value of Ay can be straightforwardly computed in time poly( 1 /e) 
using dynamic programming. 

The following two claims (whose proofs we defer to Section |43] | say that if djviX, Y) is large (at least e) 
then Ay must be fairly large, while if djviX, Y) is small (at most e^) then Ay must also be fairly small. 

Claim 11 IfdjviX, Y) > e, then Ay > e - 2{ce)-^l^ ■ e". 

Claim 12 IfdjviX, F) < e^, then Ay < + 2{ce)-^f^ ■ e". 

By our choice of y6 = 1 + ^ and a = 4 + |, for e smaller than a certain constant (depending on c and r) 
the following condition holds: 

e>/+4c-3/'f-3/'. (2) 

Claims [TT] and [12] imply that if we use the cover of Theorem |9] we can filter collections {F,}, E in 
the sparse form whose sum Y is e-far in total variation distance from X, by computing Ay and thresholding at 
the value ^ + 2{c£y^^ ■ e". Moreover, this filtration is not going to get rid of any collections in sparse form 
that are within e" total variation distance from the target distribution. Formally, let us define the following 
test, which takes as input the estimates {Fxif)}"^^ and decides whether or not to reject a point in S^js in sparse 
form. 

Definition 13 (A-test) The input is J/ = {F,},- e in the sparse form. Let Y = 2/ ^i- If^Y < +2{ceY^P ■ e" 
then the A-test accepts J/, otherwise it rejects J/. 

Since Ay can be computed in poly(l/e) time, an execution of the A-test can be performed in poly(l/e) 
time. If (|2]i is satisfied, CI aims [TTI and [T2l impl v the following: 

Lemma 14 (Correctness of the A-test) Let {F, ), e be in the sparse form and Y - 2, F,. IfY is accepted 
by the A-test then djviX, F) < e, and ifdjviX, F) < then Y is accepted by the A-test. 

Lemma[T4l implies in particular that if {X,}, has an e^-neighbor (F, ), in iS^/i that is in the sparse form, then this 
neighbor will be accepted by the A-test. Moreover, no element (F,), e in sparse form that is accepted by 
the A-test has 2; Yi further than e in total variation distance from 2, Xj. 

4.2.2 Handling Binomial form distributions in the cover. 

We can use the A-test for the sparse points in the cover, but it could be that the target collection X = {X, ), has 
no sparse e^-neighbor in Scfi and the A-test fails to accept any sparse point in the cover We need to devise a 
procedure which similarly filters the points of heavy Binomial form in the cover so that we do not eliminate 
any e'^-close point, while at the same time not admitting any e-far point. Since X has no sparse e'^-neighbor 
in the cover, it follows from Theorem |9] that there is a collection X' :- {X'.}i € in fc(e^)-heavy Binomial 
form such that 2/ X, and 2/ X'. are within ^ in total variation distance. 

We show that the total variation distance is essentially within a constant factor of the Kobnogorov distance 
for two collections of random variables in heavy Binomial form: 

Lemma 15 Let X :— (X, ),- and ^ :— {F, ),- be two collections of independent indicators in k-heavy Binomial 
form and set X = 2,- Xi, Y = Zi Yi. Then \dK{X, F) < djviX, F) < 2 ■ dK{X, F) + 0(1 /k). 

The proof is somewhat lengthy so we defer it to Section |4!4l Given Lemma [ISl we are inspired to define 
the H-test as follows. The test takes as input the estimates {Fx{()}"^q and needs to decide whether or not to 
reject a point in in A:(e'')-heavy Binomial form. 

Definition 16 (//-test) The input is J/ = {F,}/ e S^ii in the k{^)-heavy Binomial form. Let Y — YjiYi. If 
maxo<f<„ \Fy{() — Fx{()\ < 2^ -¥ e" then the H-test accepts J/, otherwise it rejects J/. 

Like the A-test, the //-test can be performed in poly( 1/e) time. We now prove: 
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Lemma 17 (Correctness of the //-test) Suppose that X is not ^ -close to any point in S^is of the sparse form. 
Let J/ — {Yi}i e Sgf be of the k(^)-heavy Binomial form, and let Y — 2/ Yj. If if is accepted by the H-test, 
then djviX, Y) < 4e" + O(^). On the other hand, ifdjviX, Y)) < ^, then J/ is accepted by the H-test. 

Proof of Lemma [TtI Since X is not e^'-close to any sparse point in the cover, it follows from Theorem |9] 
that there exists a collection X' :- {X^'}, e >S in the A;(e'')-heavy Binomial form such that X :- and 
X' := 2, X'j are within in total variation distance. 

Suppose that J/ passes the //-test. For all £, we have IFyii) - Fx({)\ < IFyif) - Fx(f)\ + \Fx({) - Fx({)\, 
and hence dK(X, Y) < 2^ + le". Given this, we have 

djviX, Y) < djviX, X') + djviX', Y) (using the triangle inequaUty) 



4.2.3 Finishing the Proof of TheoremlTOl 

Let c be the constant hidden in the (9(-)-notation in the statement of Lemma [TtI We may assume that e is 
smaller than any fixed constant, and hence that it satisfies e > 4e^ + c ■ as well as (|2]). We now describe 
the algorithm promised in Theorem [10] The algorithm takes as input the estimates {Fx({)}"^q (which, as 
described at the beginning of the proof, can be obtained from the samples from X using Theorem|7]l. 

Algorithm 

1 . Compute the cover S^fi defined in Theorem|9] 

2. If any J/ e in the sparse form passes the A-test, output such a J/ and halt. 

3. Otherwise, if any J/ e in the A:(e'')-heavy Binomial form passes the //-test, output such a J/. 

It follows from Theorem|9]that there exists some {y,), € such that 2i Y, is within ^ in total variation 
distance from 2, X,-. If there exists such an element in the cover that is also in the sparse form, it follows from 
Lemma [14] that this point will pass the test at the second test of the algorithm and hence be returned in the 
output. On the other hand, any element {Y,}, of the cover returned by the second step of the algorithm will 
satisfy that 2, F, is within e in total variation distance from If the second step of the algorithm fails 

to return any element of the cover, it follows that X has an e'^-neighbor in the cover in heavy Binomial form. 
Lemma[T7limplies then that such a neighbor will be output in the third step of the algorithm. Moreover, the 
lemma implies that any element returned in the third step is an e-neighbor of X. Hence the algorithm is correct 
and always succeeds in returning an e-neighbor of X. Finally, the running time is dominated by the time to run 
the A-test or the //-test on each point in the cover >S^, which is easily seen to be n^-poly(l/e)-i-n-(l/e)'^''°s' '^"^^ 
This concludes the proof of Theorem [TO] and thus also of Theorem|4] ■ 

4.3 Proof of Clauns[n]and[i2] 

Recall that Y = 2,- F,- where J/ := {Yi]"^^ e S^^ is of the sparse form, as defined in statement of Theorem|9] 
Recall Claim [tn 

Claim[n] IfdMX, Y) > e, then Ay > e - l(ce)-^^ ■ e". 

Proof of Claim ITU By the definition of the total variation distance, the hypothesis implies 



<fP + djviX', Y) 

<(l^ + 2dK(X', Y) C>(l//t(e^)) (using Lemma[T5]l 

< 2dK(X', Y) + 0((P) (since Theorem|9]gives l//t(e^) = 0{^)) 

< 2dKiX', X) + 2dK(X, Y) -H O(e^) (using the triangle inequahty) 

< 4dTviX',X) + 2dKiX, Y) + O(^) (using that d}^ <2 ■ djv ) 

< 2dK(X, Y) + O(^) < 4e" + O(^). 



On the other hand, if d-rviX, Y)) < e^, it follows that dy^iX, Y)) < 2e^. Hence, for all (, 



\Fy({) - Fx{t)\ < \Fy{1) - Fx(£)\ + \Fx{l) - Fx{()\ 



< dyiX, Y) + e" < 2djv(X, Y) + e'' <2^ + e". 

Hence, J/ is accepted by the //-test. 




(3) 



z 
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We can bound the left hand side of the above as follows 



z zesupp(F) r^supp(F) 



;e,supp(F) ce,supp(F) j?supp(F) 

< Yj l/Ffe)-/x(z)l + 2(c6)-3''-e"+ 2 (4) 

ce,supp(F) c^,supp(F) 

In the last line of the above we used the bound on the support of Y and the fact that for all z e [n]: 

\fx(z) - fx(z)\ = \(Fx{z) - Fxiz - 1)) - iPxiz) - Fxiz - 1)| 

< \Fx(z) - Fx(z)\ + \Fx(z - 1) - Fxiz - 1)1 < 2 ■ e^ 

while |/x(0) - /x(0)| = \Fx(0) - FxiO)] < e^ 
Finally, note that 



2 1/^(^)1= Z \fx(z)-(fx(z)-fx(z))\ 

iupp(F) 

2 (\fx(z)\-\fx(z)-fx(z)\). 



2esupp(F) cesupp(F) 
> 

zesupp(F) 



Hence, 



2 X l-^^^^^l 

z^supp(F) zesupp(F) 

<i- 2 Z 



ce,supp(F) re,supp(F) 

<1- lAWI + 2(cer^'' ■ e". (5) 

ze.supp(F) 

Using (O, ©, Q we obtain that Ay > e - 2(ce)"^^ • e". ■ 

Recall Claim [T2l 
Claim[T2] IfdjviX, Y) < e^, then Ay < + KceY^I^ ■ e". 

Proof of Claim [T2t By definition of the total variation distance, the hypothesis implies 

z 

We can bound the left hand side of the above as follows 

z ze.supp(F) z^supp(F) 

ze.supp(F) zesupp(F) z*supp(F) 

> Y l/Ffe)-/x(z)|-2(cer3^-e"+ Y 1^(^)1- 

zesupp{F) z^supp{F) 

In the last line of the above we used the same bound we used for deriving (|4|i. 
Finally, note that 

Y 1/^(^)1 = 1- Y i-^^^^^i 

z?supp(F) ze.supp(F) 

>i- Y i-^^^^^i" Z 

zesupp(F) zesupp(F) 

>1- Y - 2(^6)-^/' . e« 

zesupp(F) 

Using the bounds above we obtain Ay < e'^ + 2(cey^^ ■ e". m 
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4.4 Proof of Lemma [H] 

Recall LemmafTSl 

Lemma [TSl Let X :— {X, ), and J/ :— {F,},- be two collections of independent indicators in k-heavy Binomial 
form and set X — Xi; Xj, Y - Xi, Y,. Then 

^dxiX, Y) < d-YviX, Y)<2- dxiX, Y) + 0(1 /k). 

Proof of LemmalTSt The first inequality is immediate from the definition of the Kolmogorov and Total Vari- 
ation distances. To show the other bound, let Zx,Zy Q [n] be the indices of the variables in the collections 
{Xi]i and {y,}, respectively that are deterministically zero, Ox, Oy the indices of variables that are determin- 
istically 1, and define Ex - [n] \Zx \ Ox, Ey = [n] \Zy \ Oy, n\ - \Ex\, n2 - \Ey\, and m - \0x\ - \0y\. 
Moreover, let p\ be the common mean of the variables X,, / 6 Ex, and p2 the common mean of the variables 
in Yi, i e Ey. Without loss of generality, we can assume that m >Q. Now we define X' = m + Bin(ni,/7i) and 
Y' - Bin(n2, p2)- It is straightforward to check that 

drviX, Y) = djv(X', Y') 

and 

dKiX, Y) = dK(X', Y'). 
Given that X and J/ are in ^-heavy Binomial form, it follows that for i - 1,2: 

,2 1 

ju; := ■ Pi >k - 

k 

o-]:^nrPi(l-pd>k^-k-l-\. 

k 

We recall that the Translated Poisson distribution is defined as follows. 

Definition 18 dRolfinl (|2006|) ) We say that an integer random variable Y has a translated Poisson distribution 
with paremeters fi and cr^ and write JLiY) — TPiji, cr^) if £XY - [ji — cr^J) — Poisson(cr^ + {^^ - c^lX where 
{yU — cr^} represents the fractional part of fx — cr^. 

Given the above, and following iDaskalakisI (|2008|) (see Section 6.1), we can show the following for 
i = 1,2: 

djv (Bin(n;, p,), TP{^ii, tr^)) = Oiljk). (6) 

Now we show the following: 

Lemma 19 For A, A > 0, m,m & No, let Y — m + Poisson(A) and Y — m + Poisson(A). Then 

dTv(Y,Y)<2dK(Y,Y). 

Proof of Lemma [191 Without loss of generality assume that m' :- m -m > 0. Then it is enough to compare 
Y' = m' + PoissoniA) and Y' = Poisson(A), since djviZY) = d-YviY',?') and dKiZY) = d^iY',?'). For 
i > 0, define 

Pr[y' ^ /] 

Ki :- — — . 

Pr[y' = /] 

Clearly, R/ - 0, for / = 0, 1, . . . , m' - 1, since Y' is not supported on that set. On the other hand for all / > m', 
we have 

A'-"'' -e-^ 
(i-m'y. 
Ki .= ^= — —, 
A'-e-'' 
i! 

and for all i > m' + 1: 

Ri _1-(; + 1-ot') 
RiTi ~ A-(i+l) 

Let us distinguish the following cases: 
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• If A > A, then ^ < 1 for all /. Hence, is increasing in /, so that it can change from a value < 1 to a 
value > 1 at most one time. Hence, there exists a single /* such that Pr[F' = /] < Pi[Y' = /] for all / < /*, 
and Pr[l" - i] < Pt[Y' = /] for all / > /*. In this case, it is easy to see that 

dTv(r,r)^d^(Y',Y'). 

• If A - A and m' = the variables Y' and Y' are identically distributed so that their total variation distance 
and Kolmogorov distance are both identically 0. If A - A and m' > 0, then ^ < 1 for all / and from our 
argument in the previous case we obtain 

dTv(Y',Y')^dK(Y',Y'). 

• Finally, if A < A, then ^ > 1 for all / < - 1 and ^ > 1 for all / > - 1 . So R/ is increasing 

up to some /* and decreasing above /*. So the distributions of Y' and Y' have at most two intersections. 
Hence, we obtain 

dTviY', Y') < 2dK(Y', Y'). 

■ 

Given the above we have, 

d-YvO^' , y') = djviin + Bin(ni,/?i), Bin(n2,/'2)) 

< dTv{m + Bin(ni,/7i), m + TP(jJucr])) + dTv{m + TP(hu(t\), TP(jJ2,crl)) 

+ djv (Bin(n2, pi), TP(fi2, crfij (using the triangle inequality) 

< 0(1 /k) + djv {m + TP(}iu(r\), TP(jj2, erf)) (using ©) 

< 0{l/k) + 2<iK [m + TP(^iua-\), TP(jd2, crj)) (using Lemma [T9]l 

< 0(l/k) + 2dK[m + Bm(nupi),m + TPQjucr])) + 2c/k (Bin(«2,/?2), TP(id2,(rl)) 

+ 2dK (m + Bin(ni, pi), Bin(n2, P2)) (using the triangle inequality) 

< 0{l/k) + 2d]i(m + Bm(ni, pi), Bin(n2, P2)) (using that c/k ^ 2 ■ c/xv and®). 
= 0{l/k) + 2dKiX',Y') 

This concludes the proof of Lemma[T5] ■ 

5 An intermediate case: Linear transformation functions with 0(1) distinct weights 

Recall that Theorem|2]shows that the exponential running time of Theorem|6]cannot be significantly improved 
even if the transformation function / is a degree-2 polynomial, and Theorem |3] shows that the (5(«) sample 
complexity cannot be significantly improved even if / is a simple linear function. In contrast with these 
strong negative results, we have also seen that the sum-of-Bemoullis transformation function f(x) = Yj"=] 
admits highly efficient algorithms both in terms of running time and sample complexity. We close this paper 
by showing that in an intermediate case - if the transformation function / is a linear function with constantly 
many different weights - then it is also possible to improve on the generic time and sample complexity bounds 
of Theorem|6] though not quite as dramatically as for sums of BernouUis: 

Theorem 20 (Linear transformation functions with 0(1) different weights) Let f(x) = ^/Li be any 
function such that there are at most k different values in the set {fli, . . . , «„). Then there is an algorithm that 
uses klog(n) ■ 0(e^^) samples from the target distribution fCp), runs in time poly(n* • e"*'°s ''^"^'X and with 
probability at least 9/10 outputs a hypothesis vector p such that djv(f('p),f(p)) ^ f- 

Note that setting ai = ■ • • = a„ = 1 in Theorem l20l gives a weaker result than Theorem since the 
resulting sample complexity is log(n) ■ (5(e"^), whereas Theorem |4] gives a poly(l/e) sample complexity 
bound independent of n. 

Proof of Theoreml20t We claim that the algorithm of Theorem|6]has the desired sample complexity and can 
be implemented to run in polynomial time. 

Let {^>}J=i denote the set of distinct weights and nj - \i e [n] \ a, = bj\, where k - 0(1). With this 
notation, we can write f(X) = t>jS j - g(S), where S - (Si,. . .,Sk) with each Sj a sum of nj many 
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independent Bernoulli random variables and g{yi, . . .,yk) — "Fj^i bjUj- Clearly, 'Z!j=\ — «■ By Theorem|9] 

5 j has an explicit e-cover Si of size \Si\ < ir'j ■ 0{\le) + n ■ (l/e)''*'°=^ By independence across 5/s, the 
product Q - 11^=1 Si is an e-cover for S , hence also for f(X). That is, f{X) has an explicit e-cover of size 
\Q\ = n*=i \Si\ < {nlkf' ■ (l/e)*'^<'°g' The sample complexity bound follows directly. 

It remains to argue about the time complexity. Note that the running time of the algorithm is (9(|62p) times 
the running time of a competition. We will show that a competition between Qi, Q2 e Q can be efficiently 
computed. This amounts to efficiently computing the probabilities pi = /(Qi)(^i) and qi = fiQiX'^i)- 
Note that nV = /({0, 1)") = Z^i bi ■ (0, 1, . ..,«;). Clearly, \^\ < nj=i(nj + 1) = 0((n/kf). It is thus 
easy to see that pi,qi can be efficiently computed as long as there is an efficient algorithm for the following 
problem: given Q e Q and w e "W, compute f(Q)(w). Indeed, fix any such Q, w. We have that f(Q)(w) = 
2mi,...,mt 11^=1 Prx~Q['5j - tnj], where the sum is over all A:-tuples (wi, . . . ,m^) such that < nij < rtj for 
all j and biirii + ■ ■ ■ + bunik - w (as noted above there are at most 0((n/k)'') such ^-tuples). To complete 
the proof of Theorem|20]we note that Prx~Q[5'j = nij] can be computed in 0(nj) time by standard dynamic 
programming. ■ 

6 Conclusion and open problems 

We feel that the transformed product distribution learning model offers a rich field for further study, with 
many natural directions to explore. We close this paper with some specific questions and suggestions for 
future work. 

Optimally learning sums of BernoulUs? An obvious question is whether the competing advantages of 
Theorems|4]and|5]can be simultaneously achieved by a single algorithm: is there an algorithm to learn sums 
of Bemoullis that uses poly(l/e) samples and runs in poly(log n, 1/e) time? 

Learning weighted sums of BernouUis? In Section |5] we observed that a poly(n)-time algorithm exists for 
any linear transformation function fixi, . . ., x„) = Yj'Li '^i^i which there are only (9(1) many different fl,'s. 
Can a poly(n)-time algorithm be obtained for every linear transformation function f{x) = YJLi '^i^i^ where 
(fli, . . . , fl„) is an arbitrary vector in R"? What if each a, is a positive integer that is at most poly(n)? 

Learning when the transformation function is in A^C"? Suppose that the transformation function / maps 
{0, 1}" to {0, 1)", i.e. / = (/i, . . .,/„), where each fi : {0, 1)" {0, 1) is a A:-junta - a function that depends 
only on k of the n input variables - for some constant k. Is the corresponding transformed product distribution 
learning problem solvable in poly(n, 1/e) time? We conjecture that the answer is yes. (Note that as suggested 
by the second example of the Introduction, an algorithm for a special case of this question (in which each f, is 
a particular (2k - l)-junta) yields a poly(n/e)-time algorithm for learning a mixture of k product distributions 
over {0, 1)". This mixture learning problem is indeed known to b e solvable in poly(n/e) time for constant k 
but the algorithm is somewhat involved, see lFeldman et al.l (l2008h .) 

When do (9(1) samples information-theoretically suffice? Our main result in Section |4] shows that for 
the transformation function f(x) - ^i, the sample complexity required for learning to accuracy e is 
poly(l/e) independent of n. But as we show in Appendix iDl the seemingly similar linear transformation 
function f{x) = Zi)L„/2+i i^i requires Q.{n) samples, which is close to the worst possible for any /. This 
disparity motivates the following question: what necessary or sufficient conditions can be given on a function 
/ : {0, 1 )" R that cause the corresponding learning problem to have sample complexity depending only on 
e (independent of n)? More ambitiously, is there a quantitative measure of the "complexity" of a function / 
that gives a tight quantitative bound on the sample complexity of the product distribution learning problem 
for /? The Vapnik-Chervonenkis dimension of a concept class C plays such a role in the PAC learning model, 
since it tightly characterizes the number of examples that are required to solve the learning problem for C. Is 
there an analogous measure of the "complexity" of a transformation / for our product distribution learning 
problem? 
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A Proof of Observation II. It AND-gate functions 

Let p - (pi, . . ., p„) be the unknown target vector of probabilities. For each / e [m] let P, denote the true 
probability Prx[/,(X) - 1] where the probability is over X drawn from the product distribution /J. Using 
poly(n, 1 /e) random samples of f(X) it is straightforward to obtain upper and lower bounds < < < 
1 such that - Pi - < I and with probability at least 9/10, every / e [m] has f , _ < f , < f 
For each / e [m] we have that the function fix) is A;e5, ^i'-: by independence we have 



Using the bounds f , _ and f , + it is straightforward to set up a system of linear inequalities in variables 
qi,. . .,q„ where each qi plays the role of log i.e. for a given / we have the inequahties 



1999. 




Pi and thus log P, = 
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logP,,_ <^^, <logP,,+ . 

(We also include the inequalities qi < for each / since the p,'s must be probabilities, i.e. values at most 
1.) With probability at least 9/10 the system is feasible (since setting each qi to be logp, gives a feasible 
solution), so we can use polynomial-time linear programming to obtain a feasible solution qi, . . ., q„. The 
corresponding product distribution p - {p\,. . ■,p„) where - 2* has the property that for each /, we have 
I ^'^x~p[fi(^) = 1] - ^"11 — m- ^ simple argument (using independence between the different fi(Xys, which 
holds since the sets 5, are pairwise disjoint) then shows that the total variation distance divif(P),f(p)) is at 
most e, and Observation ll.ll is proved. ■ 

B Proof of Theorem |6t A tournament between distributions in a cover 

Recall Theorem|6l 

Theorein|6] Fix a function / : {0, 1)" — > Q where Q is any range set. Suppose there exists a 6-coverfor f(S) 
of size N — N(n, 6). Then there is an algorithm that uses 0(6^^ log A^) samples and solves the f -transformed 
product distribution learning problem to accuracy 66. 

Proof: Let P e >S be the input distribution fed to the circuit /. We will describe an algorithm that, given 
m - 0(6^^ log N) independent samples s = from /(P), finds a distribution Q* e S that satisfies 

'^Ty(/(P),/(Q*)) < 66 with probabihty at least 9/10. 

Recall that the high-level idea of the proof is as follows. For a pair of distributions Qi, Q2 e S, we will 
define a competition between Qi and Q2 that takes as input the sample s and either crowns one of Qi, Q2 as 
the winner of the competition or calls the competition a draw. Let <3 £ >S be a 5-cover for f{S) of cardinality 
= N(n, 5). The algorithm performs a tournament between every pair of distributions from Q and outputs a 
distribution Q* e Q that was never a loser (i.e. won or was a draw in all competitions). If no such distribution 
exists, the algorithm outputs "failure." 

To describe the competition procedure between two distributions Qi,Q2 e S, we define the following 
partition of the range space "W - /({0, 1)") c O: 



^1 := {«; 6 I fmiw) > f{(h)(w)]- nV2 := W \ 'W, . 

Let pi - /(QOCVt^i) and qi - /(Q2)(^i), and define pi - ^ - P\ and q2 - \ - q\. Clearly, p\ > q\ and 
P2 < qi. Moreover, d-rvifiQilfiQi)) = Pi - qi- Finally, let T(s) = ^|{/ | i,- E be the fraction of 

samples falling in the set ^Wi . The outcome of the competition between Qi and Q2 is decided as follows: 

• If p\ - qi < 56, return "draw"; 

• else if T{s) > pi - ^6, return Qi; 

• else if T(s) < qi + j6, return Q2; 

• else return "draw". 

Observe that the outcome of the competition does not depend on the ordering of the pair of distributions given 
in the input; i.e. on inputs (Qi, Q2) and (Q2, Qi) the competition outputs the same result for a fixed sequence 
of samples si, . . . , s^. 

We now prove correctness. Our first lemma quantifies the following intuitive fact; If /(Qi) is "close" to 
/(P) while /(Q2) is "far" from /(P), then with very high probability over the sample Qi will be the winner 
of the competition. 

Lemma 21 Let Qi e S be such that djvifi^), /(Qi)) < 6. 

(i) If Q2 e S is such that t/Tv(/(P),/(Q2)) > 66, the probability that the competition between Qi and Q2 
does not return Qi as the winner is at most e"""* 

( ii) //'Q2 e is such that djvifi^), /(Q2)) > 45, the probability that the competition between Qi and Q2 
returns Q2 as the winner is at most e^™*"/^. 

Proof: Let r denote f(P)('Wi). The definition of the total variation distance implies that \r - pi\ < 6. Let 
us define the 0/1 (indicator) random variables {Z,)'"j, as Z,- = 1 iff Sj e Wi. Clearly, T(s) - ^ 2™ 1 Z,- 
and E[r] - E[Z,] - r. Since the Z,'s are mutually independent, it follows from the Chernoff bound that 
Pr[r <r- 6/2] < e''"^''^. Using \r - pi \ < 6 we get that Pr[r < pi - 36/2] < e''"^''^. 
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We prove part (i) first. Since djv(f(P),f(Q2)) > 6^, the triangle inequality implies that p\ - q\ - 
djvif (Qi), /(Q2)) > 56. Hence, with probability at least 1 - e^'''^'/^ the competition between Qi and Q2 will 
output Qi as the winner 

Now we prove part (ii). If pi - q\ <56 then the competition returns "draw" and Q2 is not the winner If 
P\ - q\ > 56 then the above implies that the competition between Qi and Q2 will output Q2 as the winner 
with probability at most e"™''"^^. ■ 

Our second lemma completes the proof of Theorem |6] 

Lemma 22 Ifm = Q.{6^^ log A^), then with probability at least 9/ 10 the tournament outputs some distribution 
Q* 6 Q such that c/ti/(/(Q*)'/(P)) ^ 65. 

Proof: Since (3 is a 5-cover of f{S), there exists Q E £2 such that t/Tv(/(Q),/(P)) < 6. We first argue 
that with high probability this distribution Q never loses a competition against any other Q' e Q (so the 
tournament does not output "failure"). Consider any Q' e Q. If iiTi/(/(P), /(Q')) > 4(5, by Lemma l2lT ii) the 
probability that Q loses to Q' is at most e-'«*'/8 = 0(1 /N). On the other hand, if d-Yvifi^), /(Q')) < 4(5, the 
triangle inequality gives that iiTv(/(Q),/(Q')) ^ 56 and thus Q draws against Q'. A union bound over all 
distributions in Q shows that with probability 19/20, the distribution Q never loses a competition. 

We next argue that with probability at least 19/20, every distribution Q' e (3 that never loses has /(Q') 
close to /(P). Fix a distribution Q' such that djvifiQ'), f(f)) > 66; Lemma|2TIi) implies that Q' loses to Q 
with probability 1 -e^™*"/^ = 1 - Oi l IN). A union bound gives that with probability 19/20, every distribution 
Q' that has djv{ f{(^),f{¥)) > 66 loses some competition. 

Thus, with overall probability at least 9/10, the tournament does not output "failure" and outputs some 
distribution Q* such that divifif'), f(Q*)) is at most 66. This proves Lemma|22]and Theorem|6] ■ ■ 



C Learning transformed product distributions can be computationally hard 

Recall Theorem|2l 

Theoremg] Suppose NP ^ BPP. Then there is an explicit degree-2 polynomial f (given in Equation ^ below) 
such that there is no polynomial-time algorithm that solves the transformed product distribution learning 
problem for f to accuracy e — 1/3. 

Proof: The function / is quite simple. It takes m - n^ + n bits of input 

(W, S) - (Wi,!, . . . , Wi^n, 102,1, ■ ■ ■ , W2,,„ W„^i, Wn,n, S\, . . . , S„). 

Here we think of each n-bit substring . . . «;,■.„ e {0, 1}" as the binary representation of the number W, = 
2"=i 2""-'i(;;.j e {0, 1 , . . . , 2" - 1 }. We think of s\, . . . , s„ as representing a subset S £ [n], where i e S if and 
only if Sj - 1 . The function f(w, s) is defined to be 

f(w, s)^i 2''"lV„+i-,- + i Wi(2si - 1) (7) 

1=1 i=l 

which is easily seen to be a degree-2 polynomial. 

Recall that an input to the NP-complete PARTITION problem is a list of n numbers Wi, . . . ,W„. The input 
is a yes-instance if there is a set 5 c [«] such that Y^ies - Has ( equivalently, if there is a bitstring 
(si,..., s„) such that Wi(2si - 1) = 0). 

It is easy to see from the definition of / that the output number f(w, s) can be viewed (reading from most 
significant bit to least significant bit) as specifying 

• the n numbers Wi, . . . ,W„; and 

• the value Yj1=i Wi(2si - 1) of the candidate solution S c [«]. Note that this value is if and only if S is 
a legitimate solution to the PARTITION instance specified by (W\, . . . , W„). 

So we may view the input to / as the tuple (Wi, . . . , W„,S) and its corresponding output as the tuple 
(Wi, . . ., W„, u) where u - Wi(2si - 1) e N is the value of the candidate solution S . 

The proof is by contradiction, so let us suppose that Lisa learning algorithm that runs in polynomial time 
and learns to accuracy e - 1/3. For any "target distribution" p e [0, 1]"', if L is given access to q(n) = poly(n) 
many independent draws from fCp), then with probability 9/10 algorithm L outputs a vector p = (pi,. . ., p,„) 
of probabilities s.t. the total variation distance dTv(f('p),f(p)) is at most 1 /3. 
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We now explain how L yields a randomized poly(n)-time algorithm A to solve PARTITION. Given a 
PARTITION instance {W\, . . ., W„) as input, algorithm A runs L on a data set consisting of q{n) copies of 
the tuple (Wi , . . . , W;,, 0). If L fails to return a vector p then A outputs "no" (meaning that the PARTITION 
instance has no solution). If L returns a vector p then: 

• A checks that n"=i 1 - p,] > 2/3; if not it returns "no." 

• If A reaches this step, for each / e [n^] let bj be the result of rounding pi to the nearest integer (0 or 1) 
and let {W[, . . ., W^) be the PARTITION instance represented by the string (bi,. . . , b„2). A checks that 
(W[, . . . , W'„) is identical to {Wi, . . . , if not it outputs "no." 

• If A reaches this step, then A draws 100 random n-bit strings x' , . . . , jc'"*' independently from the product 
distribution {p„2+i, . . . ,p„2+„) and evaluates Wjilxi - 1) on each of them. If any evaluation yields 
then A outputs "yes" and otherwise it evaluates "no." 

We now prove correctness. Suppose first that (Wi, . . . ,W„) is a yes-instance of PARTITION. With 
probability 1 the data set consisting of q{n) copies of (Wi, . . . , W„,0) is identical to the outcome of q(n) 
draws from the distribution f{p*), where each coordinate of p* is either or 1, the first coordinates 
/?*,... ,p*2 encode the numbers {Wu . . . , W„), and the last n coordinates encode a legitimate solution S for 
(1^1, . . . , W„). Thus, with probability at least 9/10 the algorithm L outputs a vector p = (pi, . . .,pm) s.t. 
dTv{f(p),f(p*)) ^ 1/3. Since f{p*) puts all its weight on one output string (Wi, . . ., W„,Q), it must indeed 
be the case that H/Li max{/3/, 1 - p,) > 2/3, and it is easy to see that the string b = (bi, . . . ,b„2) defined 
in the second bullet will be identical to (Wi, . . . , W„). Thus (W[, . . ., W,') is identical to (Wi, . . ., and 
A makes it through the second bullet. Finally, since djv(f(p),f{p*)) ^ 1/3, in expectation at least 2/3 of 
the 100 strings independently drawn from (p„2+i, . . . , p„,) should be legitimate solutions, and the probablity 
that none of x', . . . , x'™' is a legitimate solution is at most 0.001. Thus, if (Wi, . . . , W„) is a yes-instance of 
PARTITION, the overall probability that A outputs "yes" is at least 0.89. 

Now suppose that (Wi , . . . , W„) is a no-instance of PARTITION, but that A outputs "yes" on (Wi , . . .,W„) 
with probability at least 0.1. This means that with probability at least 0.1, L outputs a vector p such that 
n"=i niax{^,, 1 - Pi] > 2/3, which can be uniquely decoded into a PARTITION instance (W[, . . . , W,^) which 
must equal (Wi, . . . , W„). The PARTITION instance {W[,...,W'„) = (Wi, . . . , W„) must be satisfied with 
probability at least 1/1000 by a random string drawn from the probability distribution ip„+i, . . ■,Pm) (for 
otherwise the probability of a "yes" output would be less than 1/10). But this violates the assumption that 
(Wi, . . . , Wn) is a no-instance. So if (Wi, . . . , W„) is a no-instance of PARTITION, then it must be the case 
that A outputs "no" on (Wi, . . . , W„) with probability less than 0.1. 

Thus we have shown that A is a BPP algorithm for the PARTITION problem, and Theorem|2]is proved. 



D Proof of Theorem IS f{x) = Zt/t/i+i requires Q(k) samples 

We define a probability distribution over problem instances (i.e. target probability vectors 'p) as follows: A 
subset S c {k/2 + 1, . . . , A:) of size \S \ = k/lQQ is drawn uniformly at random, i.e. each of the (j^*(oo) outcomes 
for S is equally likely. For each / € S the value pi equals 100/^ - l/\S |, and for all other / the value pi equals 
0. We will need two easy lemmas: 

Lemma 23 Fix any S,p as described above. For any j e {k/2 + I, . . . ,k} we have f('p)(j) + if and only if 
i eS. For any j eS the value f(p)(j) is exactly (100/fe)(l - 100/^)*^'"°"' > 35 /k (for k sufficiently large), 
and similarly f('p)({k/2 -I- 1, . . . , ^)) > 0.35 (again for k sufficiently large). 

The first claim of the lemma holds because any set of c > 2 numbers from {k/2 + I, . . .,k] must sum to more 
than k. The second claim holds because the only way a draw of X from p can have f{X) = j is if Xj = 1 and 
all other Xj are (and uses limj^^o+(l - l/xY = 1 /e). 

The next lemma is an easy consequence of Chemoff bounds: 

Lemma 24 Fix any p as defined above, and consider a sequence of k/2000 independent draws of X from p. 
With probability 1 — e"^(*) the total number of indices j 6 {k{ such that Xj — 1 in any of the k/2000 draws is 
at most k/\000. 

Proof of Theorem|3t Let L be a learning algorithm that receives ^/2000 samples. Let S c {k/2 + \,.. .,k} 
and the target distribution p be chosen randomly as defined above. 
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We consider an augmented learner L' that is given "extra information." For each point f{X) in the sample, 
instead of receiving f{X) - Y!i=ki2+\ learner L' is given the entire vector {X\, . . . ,Xk) e {0, 1}*^. Let T 

denote the set of elements j e {kjl + 1 , . . . , ^) for which the learner is given some vector X that has Xj - 1 . By 
Lemma l24l we have |r| < A:/1000 with probability at least 1 - e""^*^*; we condition on the event |r| < fc/1000 
going forth. 

Fix any value £ < fe/1000. Conditioned on |r| - I, the set T is equally likely to be any /'-element subset of 
S , and all possible "completions" of T with an additional ^/ 100 - > 9A:/1000 elements of {^/2 + 1 , . . . , ^) \ T 
are equally likely to be the true set S . 

Let p - (p\,. . ., pii) denote the hypothesis vector of probabilities that L' outputs. Let R denote the set 
{k/2 + l,...,k]\T; note that since \T\ = { < fc/1000, we have \R\ > 499yt/1000. We consider two possible 
cases for p and show that in either case the learner's hypothesis distribution f{p) has high error (in the first 
case because of outcomes in R, in the second case because of the outcome 0) with high probability. 

Case 1: Fewer than k/4 of the (at least 499/t/lOOO) elements / € R have pi > IQ/k. Let U be the set 
of those (fewer than k/4) elements. Since every (fc/100 - 0-element subset of R is equally likely to be the 
correct completion S \ T, and as observed above k/lOO-fis at least 9fc/1000, an elementary argument shows 
that with probability 1 -e-"*'^* we have that U contains at most 8fc/1000of the A:/100- f > 9A:/ 1000 elements 
of S \ T. Assuming this happens, Lemma|23]now implies that each of the (at least) fe/1000 "missed" elements 
(that are inS \ T but not in U) contributes at least 35/A:-10/^>25/A:to the total variation distance between 
fip) and f(p). (This is because each point in (S \ T) \ U has probability at most 10/^ under fip); the only 
way to get such an outcome / from f(p) is for the draw of X from p to have X, - 1, which occurs with 
probability 10 /k.) So in this case dTvifCp, fip)) is at least 25/1000 = 1/40. 

Case 2: At least k/4 of the (at least 499^/1000) elements / e R have pi > 10/k. In this case, we have 
/(^)(0) < (1 - lO/k)'''* < 1/10. Since the target distribution fCp) has f(p)(0) = (1 - lOO/zt)*/'™ > 0.35, it 
follows that djvifCp), fiP)) > 1 /4 in this case. ■ 

E A Simple Proof of the DKW Inequality 



are independent copies of X, and Z. is defined to be lz,<f, for all ^ = 0, . . . , n and / = I, . . .,k. Finally, define 

Fxio-^i:iZf/k. 



Recall the framework of the DKW inequality: X is any random variable supported on {0, 1 , . . . , «). Zi , . . . , Zjt 
independei 

We prove: 

Theorem |7] (DKW Inequality) Let k = max{576, (9/8) ln(l/6)} ■ ( l/e^). Then with probability at least 1 - 6 
we have 

max \Fx({) - Fx(/)\ < e. 

We first prove the following weaker version of the inequahty: 
Theorem 25 Let k — Then with probability at least 1 - 6 we have 

max \Fx(() - Fx(i)\ < e. 

Proof of Theorem [25} Let X be the smallest index in {0, ... , n] such that ^ < Fxi-C). We first show the 
following. 



Lemma 26 Let k = 4-, where c > 0. Then 



Pr 



max \Fx{() - Fx{()\ > e 
0<f<X-l ' ' 



< 2c. 



Proof of Lemma |26t If X = the statement is vacuously true. If X > 0, let us denote A^ := Fx(() - Fx((), 
for £ - Q,. . 1 , and define 



Q_i := 



a, 



otherwise 



if A^ < 6 for all < j<{-l 

, for^ = 0,...,X-l. 



Also, define Q^, = and Q* := yr^, for ^ = 0, . . . , X - 1. 
Claun 27 Both {il*(}c=-i,...x-i '^"'^ {^c}e=-i,...,j:-i '^^^ martingale sequences. 
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Proof: Clearly, E [q;] = = E [Qq]- Moreover, for ^ = 1, 1: 

m k — m . 



E|^Af|Af_i = --Fx{(-l) 



^ + — ^ Pr[Zi ^(\Zi>(]- Fx({) 

m k-mFx(£)-Fx({-l) ^ 
k k 1 - Fxii - 1) 



1 - Fx({) 
1 - - 1) 
1 - Fx(£) 



--Fx({-1) 
k 



l-Fxi{-l) 

It follows that the sequence {^^Jk=-i,...,x-i is a martingale sequence. Hence, {^(}(=-i,...,£-i is also a martingale 
sequence. ■ 

Note that 

x-i x-i 



(=0 



{=0 
X-l 



X-l 



^=0 



= E - E [oil] -Yj2E me - £le-i)£l(-i] 

= E [q^ J _ E [q2 j] 
= E[oy, 



where in the second to last line we used the martingale property and in the last line we used that Q-i 
The same analysis holds true for the martingale sequence £11 giving: 



0. 



2E[(Q;-n*_,)2]=E[(Q^_,)2]. 

(=0 

In particular, the above imply that E [(Ox„i)^J < E [(Q^_j)^j. Now we have: 



Pr[max > e] < Pr 



l-Fx(O) 
<Pr[Qx-i > e] 

<iE[Qy 

1 1 

" ^2 ■ (1 _ _ 1))2 

4- (i-W-i))^ -^"^^^-'^ 

4 



E[(Ax-i)'1 



•Var[Ax-i] 



(X-l) 



4 

<-.Var 



= ^ ■ jFxU- I) ■ {I - FxU- I)) < ■ J 
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Considering the random variables K'^ :- —Af and repeating the analysis above, we can establish in a similar 
fashion that 

Pr[minAf < -e] < ■ -. 

e e-^ k 

Combining the above together with a union bound completes the proof of the lemma. ■ 

Now let us denote by £! the smallest index in {0, ... , n} such that 1/2 < 1 - Fx(n - £.' - 1). Applying 
Lemma|26lto the random variable n - X we obtain the following. 



Lemma 28 Let k - 4-, where c > 0. Then 



Pr 



< 2c. 



max \Fx({) - Fx(()\ > £ 

n-£'<[<n-\ ' ' 

Recall that Fx(£) > 1/2 > Fx{n -£' -I). Hence, X > n - X' - 1, which impUes that £-l>n-£'-l. 
This observation, a union bound and Lemmas|26]and|28](applied for c - 6/4) imply Theorem|25] ■ 
We are now ready to prove the actual DKW inequality. Let M denote the random variable 

max \Ae\ = max \Fx({) - Fx(d . 

0<e<n ()<f<« ' ' 

Since M is defined by the outcomes of Zi, . . . ,Zt, we can write M - g{Z\, . . . ,Zt), for some function 
: R* — > R. It is also easy to see that g(Z\ , . . . , Z^) is 1 /A:-Lipschitz as a function of its arguments Z\,. . . ,Zk. 
Since the Z,s are independent, we can apply McDiarmid's inequality to obtain 

Pr [M - E[M] > e] < exp(-2e^A:). 

By repeated applications of Theorem|25]we show the following: 

Claim 29 Letk>^. Then 

Ez„...,z. [M] < |. 

Proof of Claim|29t Note that M is supported in [0, 1]. We have 

E[M] < - ■Pr[0 < M < e/4] + ^ — ' < ^ ^ — ] 

/= 1 ^ ' 



riog(4/e)l 



e ' v-' 72'e 
- + 



2i 



- 4 ^ \4, 

oo 

< - + > : = - 

4 ^4-2' 2 

where we used Theorem |25] for the third inequality. ■ 

Note that Claim|29]imposes a lower bound on the sample complexity. From this claim and McDiarmid's 
inequality we get 

Pr [M > 3e/2] < Pr [M - E[M] > e] < exp(-2e2;t). 
Theorem|7]immediately follows by setting 2e/3 in place of e. ■ 

F Proof of Theorem H]: Learning sums of BernouUis in poly(log(n), 1 /e) time with 
efficient hypotheses 

In this sectio n we observe that the probability theory literature already provides a non-proper algorithm 
(lBirgel(ll997h )that can be used as an alternative to our Theorem|4]for learning a sum of n Bernoulli random 
variables. This algorithm has faster running time, requiring only poly(log n,l/£) time steps, but significantly 
worse sample complexity of log(n)-poly(l/e) samples (recall that Theorem|4]requires only poly(l /e) samples 
independent of n). 
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We say that a distribution p over domain {0, !,...,«) is unimodal if there exists some mode M e 
{0, 1, . . . , n) (not necessarily unique) such that the probabiHty density function (pdf) of p is monotone nonin- 
creasing on {0, ... , M] and monotone nondecreasing on [M, . . . ,n]. Our starting point is the observation that 
any sum of n BernouUi s is a unimod al distribution over {0, 1, . . . , n}. (This can be shown by a straightforward 
induction on n, see e.g. iKeilson and Gerber (1971,).) This observation lets us use a powerful algorithm due to 
lBirgel(ll997h that can learn any unimodal di stribution to a ccuracy e in total variation distance using sample 
complexity is optimal up to constant factors (*Birgel (ll987l) ). (Birge's work deals with unimodal distributions 
over the continuous interval [0, n], but it is easily modified to apply to our discrete setting.) His algorithm 
uses (9(logn/e^) samples and has overall running time of (9(log n)/e^ bit operations (note that this running 
time is best possible, given the sample complexity, since each sample is an O(logn) bit string.). It outputs 
a hypothesis distribution that is a histogram over (9((logn)/e) intervals that cover {0, . . . , n): more precisely, 
the hypothesis is uniform within each interval, and for each interval the total mass it assigns to the interval 
is simply the fraction of samples that landed in that interval. Thus the hypothesis distribution has a succinct 
description and can be efficiently evaluated using a small amount of randomness as claimed in Theorem |5] 
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